54 research outputs found
A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks
Large language models (LLMs), such as GPT-3.5 and GPT-4, have greatly
advanced the performance of artificial systems on various natural language
processing tasks to human-like levels. However, their generalisation and
robustness to perform logical reasoning remain under-evaluated. To probe this
ability, we propose three new logical reasoning datasets named "ReClor-plus",
"LogiQA-plus" and "LogiQAv2-plus", each featuring three subsets: the first with
randomly shuffled options, the second with the correct choices replaced by
"none of the other options are correct", and a combination of the previous two
subsets. We carry out experiments on these datasets with both discriminative
and generative LLMs and show that these simple tricks greatly hinder the
performance of the language models. Despite their superior performance on the
original publicly available datasets, we find that all models struggle to
answer our newly constructed datasets. We show that introducing task variations
by perturbing a sizable training set can markedly improve the model's
generalisation and robustness in logical reasoning tasks. Moreover, applying
logic-driven data augmentation for fine-tuning, combined with prompting can
enhance the generalisation performance of both discriminative large language
models and generative large language models. These results offer insights into
assessing and improving the generalisation and robustness of large language
models for logical reasoning tasks. We make our source code and data publicly
available
\url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}.Comment: Accepted for oral presentation at the LLM@IJCAI 2023 non-archival
symposiu
ProQA: Structural Prompt-based Pre-training for Unified Question Answering
Question Answering (QA) is a longstanding challenge in natural language
processing. Existing QA works mostly focus on specific question types,
knowledge domains, or reasoning skills. The specialty in QA research hinders
systems from modeling commonalities between tasks and generalization for wider
applications. To address this issue, we present ProQA, a unified QA paradigm
that solves various tasks through a single model. ProQA takes a unified
structural prompt as the bridge and improves the QA-centric ability by
structural prompt-based pre-training. Through a structurally designed
prompt-based input schema, ProQA concurrently models the knowledge
generalization for all QA tasks while keeping the knowledge customization for
every specific QA task. Furthermore, ProQA is pre-trained with structural
prompt-formatted large-scale synthesized corpus, which empowers the model with
the commonly-required QA ability. Experimental results on 11 QA benchmarks
demonstrate that ProQA consistently boosts performance on both full data
fine-tuning, few-shot learning, and zero-shot testing scenarios. Furthermore,
ProQA exhibits strong ability in both continual learning and transfer learning
by taking the advantages of the structural prompt.Comment: NAACL 202
Aligning Large Language Models with Human: A Survey
Large Language Models (LLMs) trained on extensive textual corpora have
emerged as leading solutions for a broad array of Natural Language Processing
(NLP) tasks. Despite their notable performance, these models are prone to
certain limitations such as misunderstanding human instructions, generating
potentially biased content, or factually incorrect (hallucinated) information.
Hence, aligning LLMs with human expectations has become an active area of
interest within the research community. This survey presents a comprehensive
overview of these alignment technologies, including the following aspects. (1)
Data collection: the methods for effectively collecting high-quality
instructions for LLM alignment, including the use of NLP benchmarks, human
annotations, and leveraging strong LLMs. (2) Training methodologies: a detailed
review of the prevailing training methods employed for LLM alignment. Our
exploration encompasses Supervised Fine-tuning, both Online and Offline human
preference training, along with parameter-efficient training mechanisms. (3)
Model Evaluation: the methods for evaluating the effectiveness of these
human-aligned LLMs, presenting a multifaceted approach towards their
assessment. In conclusion, we collate and distill our findings, shedding light
on several promising future research avenues in the field. This survey,
therefore, serves as a valuable resource for anyone invested in understanding
and advancing the alignment of LLMs to better suit human-oriented tasks and
expectations. An associated GitHub link collecting the latest papers is
available at https://github.com/GaryYufei/AlignLLMHumanSurvey.Comment: work in progres
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Evaluating the general abilities of foundation models to tackle human-level
tasks is a vital aspect of their development and application in the pursuit of
Artificial General Intelligence (AGI). Traditional benchmarks, which rely on
artificial datasets, may not accurately represent human-level capabilities. In
this paper, we introduce AGIEval, a novel benchmark specifically designed to
assess foundation model in the context of human-centric standardized exams,
such as college entrance exams, law school admission tests, math competitions,
and lawyer qualification tests. We evaluate several state-of-the-art foundation
models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark.
Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math
competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5%
accuracy on the English test of the Chinese national college entrance exam.
This demonstrates the extraordinary performance of contemporary foundation
models. In contrast, we also find that GPT-4 is less proficient in tasks that
require complex reasoning or specific domain knowledge. Our comprehensive
analyses of model capabilities (understanding, knowledge, reasoning, and
calculation) reveal these models' strengths and limitations, providing valuable
insights into future directions for enhancing their general capabilities. By
concentrating on tasks pertinent to human cognition and decision-making, our
benchmark delivers a more meaningful and robust evaluation of foundation
models' performance in real-world scenarios. The data, code, and all model
outputs are released in https://github.com/microsoft/AGIEval.Comment: 19 page
Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models
Learnersourcing involves students generating and sharing learning resources
with their peers. When learnersourcing multiple-choice questions, creating
explanations for the generated questions is a crucial step as it facilitates a
deeper understanding of the related concepts. However, it is often difficult
for students to craft effective explanations due to limited subject
understanding and a tendency to merely restate the question stem, distractors,
and correct answer. To help scaffold this task, in this work we propose a
self-reinforcement large-language-model framework, with the goal of generating
and evaluating explanations automatically. Comprising three modules, the
framework generates student-aligned explanations, evaluates these explanations
to ensure their quality and iteratively enhances the explanations. If an
explanation's evaluation score falls below a defined threshold, the framework
iteratively refines and reassesses the explanation. Importantly, our framework
emulates the manner in which students compose explanations at the relevant
grade level. For evaluation, we had a human subject-matter expert compare the
explanations generated by students with the explanations created by the
open-source large language model Vicuna-13B, a version of Vicuna-13B that had
been fine-tuned using our method, and by GPT-4. We observed that, when compared
to other large language models, GPT-4 exhibited a higher level of creativity in
generating explanations. We also found that explanations generated by GPT-4
were ranked higher by the human expert than both those created by the other
models and the original student-created explanations. Our findings represent a
significant advancement in enriching the learnersourcing experience for
students and enhancing the capabilities of large language models in educational
applications.Comment: Preprint. Under revie
- …